-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MarkAliasesPrepare applies bookend from inputs as well as outputs. #2815
Conversation
cc @jjsjann123 in case you need it for experiments while I'm cleaning this up for review. |
!build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks a lot for the improvement and leaving out refactor for an easy review.
if (std::all_of(first_user, last_user, [](const Use& use) { | ||
return ir_utils::isSegmentSet(use.user); | ||
})) { | ||
if (std::all_of(users.begin(), users.end(), ir_utils::isSegmentSet)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏
use_of->uses().size()) { | ||
std::vector<Expr*> users; | ||
users.reserve(use_of->uses().size()); | ||
// `uses_to_segment` is sorted so `nullptr` if exists appears first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: uses_to_segment
only exist at the caller scope.
// There are a few corner cases where we can avoid adding a | ||
// `segment_set`. If a segment_set is to be added between `use_of` and all | ||
// its users, ... | ||
if (users.size() == use_of->uses().size()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason we check for this is to use segment_seg
to avoid horizontal fusion of common consumers of a fusion input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for cases like:
in --> [meta 0] -> out0
|
+-> [meta 1] -> out1
No segment_set
s are needed at all.
// The following emulates the bookend optimization. This is done in two | ||
// steps: the first step bookends the outputs and the second step does the | ||
// inputs. TODO(wujingyue): extract this into a function. I'm adding the new | ||
// logic in place just to make review easier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙇
// +--> reshape_1 ----+ | ||
// | ||
// If we separate `reshape_0` and `reshape_1` from `mul`, the pointwise | ||
// kernel would take double the input. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to point out here that, double the input
means double the number of TensorViews passed to the kernel. But in kernel execution, the redundant read might be saved from cache.
But I agree this is still an overall good thing to have in the reshape example.
But this would stop us bookending multiple slicing on the same tensor.... arguably, in that case we aren't really duplicating memory buffer, if the slices aren't overlapping.
Not pushing to change this in the PR. Can we add a comment on that?
NVM, I see you special case about SliceOp
down below.
This PR as is will slow down
Without this PR
CUDA kernel: https://gist.github.com/wujingyue/fb8ab7f75eb6cb6f3a33bde4aaf652fa With this PR
CUDA kernel: https://gist.github.com/wujingyue/22ee624a6a387d29885ab480864c5c86 |
I suspect the regression is caused by vectorization.
Without the PR:
With the PR:
cc @liqiangxl |
3ff26bc should fix the vectorization factor problem. I'll try to separate it out to a different PR. Thanks @jjsjann123 for the help offline! |
!build |
May I get a note of your offline discussion? Just want to understand why this PR influences vectorization. Thanks! |
See #2854. Before that PR, nvFuser has trouble vectorizing the input of shape |
@kevinstephano, @tfogal, and @jjsjann123: I'd like to hear your opinions on this. As said in the description, this PR is mostly performance neutral but does slow down
Footnotes
|
This is incredibly timely! I just added a PR for some NeVA benchmarks: Lightning-AI/lightning-thunder#1064. How does this impact My feeling is that if this improves those 3 graphs, I'd argue it's a clear "merge now". Those are real-world cases on a model we care about in the near term. |
For #2599.
@tfogal That being said, I found more interesting results when benchmarking
I've yet to figure out the reason behind the slowdown in backprop. The complete fusion looks much more complicated: I noticed some inefficiencies, e.g. tensor 18, a fp32 tensor, will be segmented with this PR. However, none of them sound significant enough to explain the 1.4x. |
Anyhow, I don't think this PR is a clear win, and I'd probably wait until we understand RoPE backprop better. |
Thanks for your analysis! I appreciate you thinking to disable the cat executor too, I should have thought to mention that earlier. It would be great, of course, to understand the 1.4x slowdown, and endeavor to get this either neutral or beneficial in all cases. But I think that's also a high bar, and if it's not impacting the real-world
Sounds good! But if you start getting merge difficulties I don't think it would be dire to merge it, at least if you think this is a helpful intermediate step towards:
or
|
😢
QQ: what's the baseline for the comparison? I can't see it clearly in the graph, but did the alias pass in this PR remove the permutation on allocation domain of the outputs on the fusion? |
The comparison has always been between with this PR and without. However, in the latest comparison, I turned off the cat ex for both baseline and test. |
As a follow-up to #2639. #2639 bookends from the outputs, and this PR bookends from the inputs.
Fixes #2599.
Fixes #2577.
Benchmark results are mostly neutral.
test_nanogpt_layer_norm
got slower because of the added host latency. This is expected because, compared with #2639, latency of ops bookended from the inputs is more likely to appear on the critical path. See benchmark details below.https://gist.github.com/wujingyue/fc11f9725bc510827d9c53061ca05898
0027 -- without the PR
0028 -- with the PR
Below are the nvprof traces of
test_nanogpt_layer_norm[inference-thunder]
without and with this PR.Without the PR:
With the PR: